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Agenda 

• 1 Hitman frame 

•DirectX 12 Implementation 

•DirectX 12 vs. DirectX 11 Performance 
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Glacier 

• No precomputation 

• Fast iteration © 

• Dynamic time of day 

• Fixed on level startup 

• Probe based reflections 

• Generated on level load 

• Probes also used for ambient 

• Tile Deferred 


GDC 
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1 Frame 

• 3500 Draw 
Calls 

. 8000 
Instances 
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Light 

Macro 






Light Tiles 
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Probes 

• Reflections 

• Ambient 
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Transparent 
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Atmospheric 

Scattering 




IIIL 
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DirectX 12 Goals 

• Goals: 

• Improve CPU Performance 

• Improve GPU Performace with Async compute 

• Not a rewrite: 

• Still supporting DirectX 11 


Temp Allocator 

• DX12 requires lots of temporary resources 

• Need a fast, multithreaded allocator 

• Ours is similar to cgyrling[0] 

• Large locked allocator maintains blocks 

.1 Per resource type 

• Small lock free allocators claim blocks of resources 

.1 Per thread per resource type 
•Fences control when blocks can be reused 

[0] http://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine 
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Temp Resource types 

• Upload Memory 

• Constant buffers 

• Descriptors 

. CBV 
. UAV 
. SRV 


GDC 
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Root signature 


CD3DX12_DESCRIPTOR_RANGE R[10]; 

R[0] .Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 18, 0, 0); // 0-18 
R[l] . Init(D3D12_DESCRIPTOR_RANGE_TYPE_CBV 3 8, 0, 0); 

R[2] . Init (D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 18, 0, 0); 

R[3] . Ir\it(D3D12_DESCRIPTOR_RANGE_TYPE_CBV, 8, 0, 0); 

R[4] . Ir\it(D3D12_DESCRIPTOR_RANGE_TYPE_SRV J 18, 0, 0); 

R [ 5 ] . I n it (D3D12_DESCRIPTOR_RANGE_ TYPE_CBV , 8, 0, 0); 

R[6] . Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRVj 18, 0, 0); 

R[7] . Init(D3D12_DESCRIPTOR_RANGE_TYPE_CBV, 8, 0, 0); 

R[8] .Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 17, 15); // 19-31 
R[9] .Ir\it(D3D12_DESCRIPTOR_RANGE_TYPE_SA/-lPLER } 16, 0, 0); 

CD3DX12_ROOT_PARAElETER Slot [10]; 

Slot[0] .InitAsDescriptorTabLe(l, &R[0], D3D12_SHADER_VISIBILITY_PIXEL); 
Slot [1] .Ini tAsOescriptorrodLe(l, &R[1], D3D12_SHADER_VISIBILITY_PIXEL ); 
Slot [2] .InitAsDescriptorTabLe(l, &R[2], D3D12_SHADER_VISIBILITY_VERTEX)‘, 
Slot [3] .InitAsDescriptorTabLe(l, &R[3], D3D1 2_ SHADE R_ VIS IB I L I TY_ VER TEX ) ; 
Slot [4] . InitAsDescriptorTabLe(l, &R[4], D3D12_SHADER_VISIBILITY_HULL ); 
Slot [5] .InitAsDescriptorTabLe(l, &R[5], D3D12_SHADER_VISIBILITY_HULL ) ; 
Slot[6] . InitAsDescriptorTabLeil, &R[6], D3D12_SHADER_VISIBILITY_DOf'lAIN ) ; 
Slot [7] .InitAsDescriptorTabLe(l, &R[7], D3D12_SHADER_VISIBILITY_D0flAIN ); 
Slot [8] .InitAsDescriptorTabLeil, &R[8], D3D12_SHADER_VISIBILITY_ALL ) ; 
Slot [9] .InitAsDescriptorTabLeil, &R[9], D3D12_SHADER_VISIBILITY_ALL ) ; 



GAME DEVELOPERS CONFERENCE March 14-18, 2016 Expo: 


GDC 


Root signature 

• Per Stage 

• 18 SRVs 

• 8 CBVs 


CD3DX12 DESCRIPTOR RANGE Rf iel: 

R[0] .Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 18, 0, 0); // 0-18 
R[l] . Init(D3D12_DESCRIPT0R_RANGE_TYPE_CBV, 8, 0, 0); 

R[2] . Init (D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 18, 0, 0); 

R[3] . Init(D3D12_DESCRIPTOR_RANGE_TYPE_CBV, 8, 0, 0); 

R[4] . Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 18, 0, 0); 

R [ 5 ] . I n it (D3D12_DESCRIPTOR_RANGE_ TYPE_CBV, 8, 0, 0); 

R[6] . lr\it(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 18, 0, 0); 

R r 71 . Init(P3P12 DESCRIPTOR RANGE TYPE CBV, 8, 0, 0)j 

R[8] .Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 17, 15); // 19-31 
R[9] .Ir\it(D3D12_DESCRIPTOR_RANGE_TYPE_SAHPLER, 16, 0, 0); 


CD3DX1 2 ROO T_ PARA/-1E TER Slot [10]; 


Slot[0] .InitAsDescriptorTab 

,e(l, &R[0], D3D12_SHADER_VISIBILITY_PIXEL); 

Slot[l] .InitAsDescriptorTab 

,e(l, &R[1], D3D12_SHADER_VISIBILITY_PIXEL) ; 

Slot[2] .InitAsDescriptorTab 

.e(l, &R[2] , D3D12_SHADER_VISIBILITY_VERTEX) ; 

Slot[3] .InitAsDescriptorTab 

„e(l, &R[3], D3D12_SHADER_VISIBILITY_VERTEX); 

Slot[4 ] .InitAsDescriptorTab 

,e(l, &R[4], D3D12_SHADER_VISIBILITY_HULL); 

Slot[5] . InitAsDescriptorTab 

e(l , &R[5], D3D12_SHADER_VISIBILITY_HULL) ; 

Slot[6] .InitAsDescriptorTab 

„e(l, &R[6], D3D12_SHADER_VISIBILITY_DOflAIN) ; 

Slot [7] .Ini t.AsDescriptorTab 

,e(l, &R[7], D3D12_SHADER_VISIBILITY_DOflAIN); 


Slot [9].: 


:(1, &R t 9 l> 


D3D12_ 

D3D12 


SHADER_ 

SHADER 


VISIBILITY 

VISIBILITY 


ALL); 

ALL); 
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Root signature 

• Per Stage 

. 18 SRVs 
. 8 CBVs 

• 15 shared SRVs 


16-18,2016 #GDC16 


CD3DX12_DESCRIPTOR_RANGE R[10]; 
R[0] .lnit(D3D12_DESCRIPTOR_RANGE_ 
R[l] . In it ( k D3D12_DESCRIPTOR_RANGE_ 
R[2] . In it \d3D12_DESCRIPTOR_RANGE_ 
R[3] . I n i t ( D3D1 2_DESCRIP TOR_ RANGE _ 
R[4] . In it \d3D12_DESCRIPTOR_RANGE_ 
R [ 5 ] . I n it \d3D12_DESCRIPTOR_RANGE_ 
R[6] . I n i t \d3D12_DESCRIPTOR_RANGE_ 


TYPE_SRV, 18, 0, 0); // 0-18 
TYPE_CBV, 8, 0, 0); 

TYPE_SRV, 18, 0, 0); 
TYPE_CBV, 8, 0, 0); 

TYPESRV, 18, 0, 0); 
TYPE_CBV, 8, 0, 0); 

TYPE_SRV, 18, 0, 0); 

TYPE CBV, 8, 0, 0 ) ; 


"IZl-i nit (D3D12 DESCRIPTOR RANGE 
|~R[ 8 ] . Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV > 17, 15); // 19-31 
R[9] . Init(D3D12_DESCRIPTOR_RANGE_ TYPE_SA/1PLER } 16, 0, 0); 


CD3DX12 


Slot 

[0] 

Slot 

[1] 

Slot 

[2] 

Slot 

r 3 i 

Slot 

[4] 

Slot 

[5] 

Slot 

[6] 

Slot 

’7\ 

Slot 

Ji 

Slot 

9 


ROOT_PARAElETER Slot [10]; 

InitAsDescriptorTabLe(lj &R[0], D3D12SHADERVISIBILITY 
InitAsDescriptorTabLe{ 1, &R[1], D3D12_SHADER_VISIBILITY 
InitAsDescriptorTabLe(l, &R[2], D3D12_SHADER_VISIBILITY_ 
InitAsDescriptorTabLell, &R[3], D3D12_SHADER_VISIBILITY 
InitAsDescriptorTabLell, &R[4], D3D12_SHADER_VISIBILITY 
InitAsDescriptorTabLell , &R[5], D3D12_SHADER_VISIBILITY_ 
InitAsDescriptorTabLell , &R[6], D3D12_SHADER_VISIBILITY 
InitAsDescriptorTabLed , &RT7], D3D12 SHADER VISIBILITY 


PIXEL); 
PIXEL); 
VERTEX); 
VERTEX); 
HULL); 
HULL); 
DOMAIN) ; 
DO/-IAIN) ; 


.InitAsDescriptorTabLej 1, &R[ 8 ] , D3D12_SHADER_VISIBILITY 


, InitAsDe scriptorTab Le( 1, &R[9], D3D12_SHADER_VISIBILITY_ALL) ; 
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Root signature 

• Per Stage 

. 18 SRVs 
. 8 CBVs 

• 15 shared SRVs 

• 16 shared 
samplers 



CD3DX12_DESCRIPT0R_RANGE R[10]; 
R[0] .Init(D3D12_DESCRIPTOR_RANGE_ 
R[l] . In it ( k D3D12_DESCRIPTOR_RANGE_ 
R[2] . In it (D3D12_DESCRIPTOR_RANGE_ 
R[3] . In it \d3D12_DESCRIPTOR_RANGE_ 
R[4] . In it \d3D12_DESCRIPTOR_RANGE_ 
R [ 5 ] . I n it \d3D12_DESCRIPTOR_RANGE_ 
R[6] . In it (D3D12JDESCRIPTOR_RANGE_ 
R[7] . Init ( D3D1 2_DE SCRIP TOR_RANGE_ 
RT 81 ■ Init(P3P12 DESCRIPTOR RANGE 


TYPE_SRV, 18, 0, 0); 
TYPE_CBV, 8, 0, 0); 
TYPE_SRV, 18, 0, 0); 
TYPE_CBV, 8, 0, 0); 
TYPE_SRV, 18, 0, 0); 
TYPE_CBV, 8, 0, 0); 
TYPE_SRV, 18, 0, 0); 
TYPE_CBV , 8, 0, 0); 
TYPE SRV. 17. 15^: / 


D3D12 DESCRIPTOR RANGE TYPE SAMPLER. 16. 0. 0 


CD3DX12_ROOT_PARAt-\ETER Slot [10]; 

Slot[0] . InitAsDescriptorTabLe(l, 
Slot[l] . InitAsDescriptorTable( 1, 


Slot [2] .InitAsDesc 


Slot[3] . InitAsDescriptorTabLe( 1, 


Slot[4] .InitAsDesc 
Slot[5] . InitAsDesc 
Slot[6] . InitAsDesc 
Slot[7] .InitAsDesc 
Slot[8] . InitAsDesc 


Slot[9] . InitAsDesc 


iptorTabLe(l t 


iptorTabLe( 1, 
iptorTabLe(l , 
1 iptorTabLe(l , 
iptorTabLe( 1, 
■ iptorTabLe(l , 


'iptorTabLe( 1, 


&R[0], 

&R[1], 

&R[2], 

&R[3], 

&R[4], 

&R[5], 

&R[6], 

&R[7], 

&R[8], 


&R[9], 


D3D12_ 

D3D12_ 

D3D12_ 

D3D12_ 

D3D12_ 

D3D12_ 

D3D12_ 

D3D12_ 

D3D12 


_SHADER_ VISIBILITY ' 
_SHADER_VISIBILITY 
_ SHADE R_ VIS IB I L I TY 
SHADE R_ VISIBILITY_ 
SHADE R_ VISIBIL ITY_ 
SHADE R_ VISIBILITY_ 
SHADERVISIBILITY 
SHADE R_VISIBILITY_ 
_SHADER_ VISIBILITY_ 


SHADER VISIBILITY 


PIXEL); 
PIXEL); 
VERTEX); 
VERTEX); 
HULL); 
HULL); 
DOMAIN) ; 
DOflAIN); 
ALL); 


ALL); 
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Descriptor burn 

• Per draw descriptor usage: 

. 36 for SRV, 

. 16 for CBV 

• 520k Descriptors for a 10k draw frame 

• Writing that many descriptors is slow 

• Requires multiple descriptor heaps 
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Descriptor burn 




• Example: 

• SRV descriptors, one stage, three draw calls 

• Naive way 


i i 1 1 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i m 

Draw 0 Draw 1 Draw 2 
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GDC 


Descriptor burn 




• Example: 

• SRV descriptors, one stage, three draw calls 

• Naive way 


i i 1 1 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i m 
Draw 0 Draw 1 Draw 2 


Observation: Not all entries are used 
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Descriptor burn 



• Solution: Allow overlap 

• Only put in descriptor actually used by shader 

• Restricts Descriptor heap type 

• Pad with Null descriptors 

• Only on submit 



nm 


Draw 43 


Draw 1 


-4 


Draw 2 



Pipeline State Objects 

• Our interface is still DX11 based 

• Programmers prefer this 

• PSOs handled internally 

• Store an array in with the Pixel Shader 

• State is hashed into 128bit key 

• Every object has a runtime unique id 

•Assigned & deduplicated on creation 
•Makes the hashing a no-op 


GDC 
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Pipeline State Objects 


struct SPipelineStateObjectHash 

{ 

union 

{ 

struct 

12 4k 

12 4k 

6 64 

7 128 

8 256 

8 256 

4 16 

3 8 

12 4k 

12 4k 

= 84 

struct 

{ 

uint64 nHash0; 
uint64 nHashl; 

}J 

}; 

}; 


uint64 VertexShader : RENDERSHADERBITS; // 
uint64 PixelShader : RENDER_SHADER_BITS; // 
uint64 InputLayout : RENDER_INPUT_LAY(XJT_BITSj // 
uint64 RasterizerState : RENDER_RASTERIZER_STATE_BITS; // 
uint64 BlendStateState : RENDER_BLEND_STATE_BITS; // 
uint64 DepthStencilState : RENDER_DEPTHSTENCIL_STATE_BITS; // 
uint64 RenderTargetFormat : RENDER_TARGET_FORF\AT_BITS; // 
uint64 Topology : RENDER_TOPOLOGY_BITSj // 
uint64 DomainShader : RENDER_SHAOER_BITS; // 
uint64 HullShader : RENOER_SHADER_BITSj // 
uint64 _Pad : 128 - 84; // 
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Multithreading 

• Want to submit command list before they are finished 

• Allows more parallelism 

• Async Command Lists 

• Not available in DirectX 12 

• Easy to emulate 

• Push all Command Lists into a queue 

• Submit in order as they finish 


GDC 


<> 
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Async Compute 

• Overlap independent work 



. SSSAA 
. SSAO 

• Light Tile Calculations 
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Async Compute 

• Graphics Queue: Write Fence 

• Graphics Queue: Render Shadows 


GDC 


<> 
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Async Compute 

• Graphics Queue: Write Fence 

• Graphics Queue: Render Shadows 

• Compute Queue: Wait for Fence 

• Compute Queue: Execute Async work 
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GDC 


Async Compute 

• Graphics Queue: Write Fence 

• Graphics Queue: Render Shadows 

• Compute Queue: Wait for Fence 

• Compute Queue: Execute Async work 

• Compute Queue: Write Fence 
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Async Compute 

• Graphics Queue: Write Fence 

• Graphics Queue: Render Shadows 

• Compute Queue: Wait for Fence 

• Compute Queue: Execute Async work 

• Compute Queue: Write Fence 

• Graphics Queue: Wait for Fence 
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Async Compute 

. Win of 5-10% on AMD 

• No difference on Nvidia 

• Working with Nvidia to get this fixed 

• Hard to tune. 

• Too much async work can make it a penalty 

• PC has lots of configurations 


GDC 
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Resource Transitions 






D3D12 Transitions are complicated 

• We dont want to have to worry too much about that when writing 
code 


We annotate render code with transitions 


• Simplified version of 

• Only two transitions 

•To View defined state 

• UAV for UAVS 

• RTV for RTVS 

• DSV for DSV 
•To Read 


D3D12 Transitions 


#define SUBRESOURCE TRANSITIONSRVCpDeviceContext, pSRV) ... 
#define SUBRESOURCE_TRANSITION_RTV(pDeviceContext, pRTV) ... 
#def ine SUBRESOURCE_TRANSITION_RTV_READ(pDeviceContext, pRTV) 
#define SUBRESOURCE_TRANSITION_DSV(pDeviceContext , pDSV) ... 
#define SUBRESOURCE_TRANSITION_DSV_READ(pDeviceContext, pDSV) 

#define SUBRESOt)RCE_TRANSITION_UAV(pDeviceContext J pUAV) 

#def ine SUBRESOURCE_TRANSITION_UAV_READ(pDeviceContext J pUAV) 


• One exception per resource 

• Subresource implied by view 
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Resource Transitions 

• We only allow transitions on one thread 

• No resource state patching 

• Batching & optimization of changes becomes 
simple 


Resource Transitions 

• Slow when gpu bound? Check your transitions 

• Dont do unecessary transitions 

• Use COMMON to upload 

• VB, IB, Read only Textures 

. Never use COMMON or GENERIC_READ for 

• Render Targets 
. UAVs 


GDC 
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Memory Budget 

• You should care about memory budget 

• Can change dynamically 

• If you fail to follow, Windows will enforce 

• Resources will be pushed out of video memory 

• No Resource Priorities in DX12 

• They exists for the driver 

• Usually this is enough 

• We had problems with UAVs being pushed to system memory 

• Maybe we'll be able to set priorities in the future? 


ink 
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MakeResident & Evict 

• The official guide line is: 

• Use MakeResident & Evict to ensure you are within the memory budget 

• Evict 

• Makes a ressource unusable 

• Lazy, Never blocks 

• But budget updated immediately 

• MakeResident 

• Makes an Evicted resource usable 

• Synchronous 

• Time proportional to size of resource 


GDC 
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The MakeResident/Evict Rabbit Hole 

• Complicated 

• Hard to get right 

• Easy to get wrong 

• For Optimal Eviction 

• All resources are comitted resources 

• Wastes huge amount of memory (lgb!) 

• Comitted resources are 64kb aligned 

• Compromise: 

• Resources >= 64KB -> Comitted 

• Resources < 64KB -> Suballocated in multiple heaps 

• VB/IB in system mem on low end hardware 

• Only Evict once per frame 



nit 
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D3D11 vs D3D12 

Frame Time, Relative to DX11 
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80 
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DX12 ■ DX11 
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D3D11 vs D3D12 

Frame Time, Relative to DX11 
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DX12 ■ DX11 
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Questions? 

• jonasm@ioi.dk 



■ ■ik. 



GDC 
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Thank you for listening 






